Give a brief a description of your project and its goal(s), what data you are using to complete it, and what three faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository.
This project considers the burdens of poor air quality and its health consequences, and how they fall on different American communities. To explore and understand these relationships, I use daily site-level testing data from the EPA’s Air Quality System for small particulate matter (PM2.5) from 2019, linked with census data for the tract surrounding each testing site. I am still determining how I can best incorporate data on the near-term health effects of poor air quality, although my main goal is not to use air quality to predict those health outcomes, but rather to determine whether there are differences (e.g. in extent/significance) between how community characteristics predict air quality and how those same characteristics predict the prevalence of events like emergency department presentations for asthma exacerbation. In developing a plan for this project I met with Drs. Anil Vachani, Gary Weissman, and John P. Reilly, who provided insights on the causal pathway from pollution sources to specific EPA-measured pollutants to short- and long-term health outcomes; potential data sources and analytic approaches; the ways in which “natural experiments” related to major changes in pollutant generation have been used to compare the effects of perturbation from a steady state; and more. Materials for this project can be found in this GitHub repository.
Describe the problem addressed, its significance, and some background to motivate the problem. Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff.
Poor air quality can have a substantial negative influence on health and quality of life, but different people will have different resources, incentives, and tradeoffs to consider in determining where to live, and we might expect that the relationship between who lives where and air quality is more complex than people with more wealth, income, and social power concentrating in places with better air quality. By considering many different community characteristics in building models to predict poor air quality, it is possible to learn which factors are most strongly associated with air quality and how successfully models trained on community characteristics can predict poor air quality and related health events.
Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.
I begin by defining a function to create a data frame from the results of a GET call to the EPA’s Air Quality System API. I create objects with relevant reference lists and define a vector for the “lower 48” U.S. states and Washington, D.C. for later use.
# Retrieve environment variables with EPA AQS API credentials
api.email <- Sys.getenv("EPA_AQS_EMAIL")
api.key <- Sys.getenv("EPA_AQS_KEY")
# Function to create a data frame based on an API call
download.AQS <- function(whichData,customParams) {
rootpath <- "https://aqs.epa.gov/data/api/"
morepath <- paste0(rootpath,whichData)
fullParams <- append(list(email = api.email, key = api.key),customParams)
step1.json <- httr::GET(url = morepath, query = fullParams)
step2.parsed <- jsonlite::fromJSON(httr::content(step1.json,"text"), simplifyVector = FALSE)
step3.df <- tibble(Data = step2.parsed$Data) %>% unnest_wider(.,Data)
return(step3.df)
}
# Fetch documentation for daily data summaries, state code list
dailyData.dictionary <- download.AQS("metaData/fieldsByService",list(service = "dailyData"))
list.states <- download.AQS("list/states",list())
# Omitting state codes other than 48 states + D.C.
list.states.lower48 <- list.states %>% filter(!(code %in% c("02","15","66","72","78","80","CC")))
list.states.lower48
lower48.codes <- as.vector(as.matrix(list.states.lower48[,1]))To manage downloading daily PM2.5 data for each state for all of 2019, I define another function to loop through the state FIPS codes for given date parameters. Executing this step with 1 call per month of data takes roughly 90 minutes, so I save the resulting data frame and load it in the chunk below.
# Pull daily PM2.5 (parameter 88101) data for these states, with arguments for date range
get.daily.lower48 <- function(b,e){
collector <- data.frame()
for (s in seq_along(lower48.codes)) {
newstate <- download.AQS("dailyData/byState", list(param="88101", bdate=b, edate=e, state=list.states.lower48[s,1]))
collector <- bind_rows(collector,newstate)
}
return(collector)
}
# Run in monthly batches
dailyPM2.5_201901 <- get.daily.lower48("20190101","20190131")
dailyPM2.5_201902 <- get.daily.lower48("20190201","20190228")
dailyPM2.5_201903 <- get.daily.lower48("20190301","20190331")
dailyPM2.5_201904 <- get.daily.lower48("20190401","20190430")
dailyPM2.5_201905 <- get.daily.lower48("20190501","20190531")
dailyPM2.5_201906 <- get.daily.lower48("20190601","20190630")
dailyPM2.5_201907 <- get.daily.lower48("20190701","20190731")
dailyPM2.5_201908 <- get.daily.lower48("20190801","20190831")
dailyPM2.5_201909 <- get.daily.lower48("20190901","20190930")
dailyPM2.5_201910 <- get.daily.lower48("20191001","20191031")
dailyPM2.5_201911 <- get.daily.lower48("20191101","20191130")
dailyPM2.5_201912 <- get.daily.lower48("20191201","20191231")
# Concatenate monthly batch files into a single file for 2019
all.2019.dailyPM2.5 <- bind_rows( dailyPM2.5_201901
,dailyPM2.5_201902
,dailyPM2.5_201903
,dailyPM2.5_201904
,dailyPM2.5_201905
,dailyPM2.5_201906
,dailyPM2.5_201907
,dailyPM2.5_201908
,dailyPM2.5_201909
,dailyPM2.5_201910
,dailyPM2.5_201911
,dailyPM2.5_201912)
# Save full 2019 raw file
saveRDS(all.2019.dailyPM2.5, file = "all2019dailyPM25.RDS")I test combinations of testing site location identifiers and add a single site_id variable to the 2019 PM2.5 data set by concatenating location identifiers. I also create a data frame with one observation per testing site.
# Load the full 2019 raw file
all.2019.dailyPM2.5 <- readRDS("all2019dailyPM25.RDS")
str(all.2019.dailyPM2.5)
# Create a file of all unique combinations of geographic indicators for testing sites
testing.sites <- all.2019.dailyPM2.5 %>%
dplyr::select(state_code, state,
county_code, county,
city, cbsa_code, cbsa,
site_number, local_site_name, site_address,
latitude, longitude) %>%
unique() %>%
arrange(state_code,county_code,site_number)
dim(testing.sites)
# 965 combinations
# Create a single unique ID column for site
testing.sites %>% dplyr::select(site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,county_code,site_number) %>% unique() %>% dim()
# Only 253 unique site numbers. 770 state-site combos.
# State + county + site number *is* unique (965 combos)
clean.2019.dailyPM2.5 <- all.2019.dailyPM2.5 %>%
mutate(site_id = paste0(state_code,county_code,site_number))Using the tigris::tracts function, I retrieve the geometries, GEOIDs, and other spatial information for census tracts and join each AQS testing site to the data for the tract within which it lies, based on the latitude and longitude from AQS. I looping through the state FIPS codes one at a time and stack the results into a single crosswalk file. This step is time-consuming so I save the resulting file and load it in the next step.
# Prepare for spatial merge. Create sf file from testing site coordinates
testing.sites.sf.points <- st_as_sf(testing.sites, coords = c('longitude','latitude'), crs = 4269)
# Download census tract sf shapefiles, one state at a time.
# Perform spatial joins one state at a time to avoid having to build the national tract shapefile.
for (s in seq_along(lower48.codes)) {
newstate <- lower48.codes[s]
newtracts <- tracts(newstate)
newjoin <- st_join(newtracts, testing.sites.sf.points, join = st_contains, left=FALSE)
if (s==1){
testing.sites.tracts <- newjoin
} else {
testing.sites.tracts <- bind_rows(testing.sites.tracts,newjoin)
}
}
# Save file with testing sites <> census tract
saveRDS(testing.sites.tracts, file = "testing_sites_tracts.RDS")From tidycensus I now retrieve the variable list for the 2018 5-year American Community Survey. After reviewing the concepts from those variables as well as the questionnaire, I create an object with a set of variables to test for associations with the AQS data.
testing.sites.tracts <- readRDS("testing_sites_tracts.RDS")
head(testing.sites.tracts)
# Load a data frame with all 5-year ACS variables
census.1yr.vars <- load_variables(dataset = "acs5",year = 2018)
head(census.1yr.vars)
# Simplify the variable list to make it easier to review the major concepts represented
census.1yr.concept.counts <-
census.1yr.vars %>%
mutate(root.concept = str_split_fixed(concept," BY",2)[,1]) %>%
mutate(root.concept = str_split_fixed(root.concept," \\(",2)[,1]) %>%
# Double escape needed to match open parenthesis!
mutate(root.name = substring(name,1,6)) %>%
group_by(root.name,root.concept) %>%
summarise(obs=n(), .groups="keep") %>%
arrange(root.name,-obs)
# View(sort(unique(census.1yr.concept.counts$root.concept)))
# After reviewing the variable list, create an object with the ones of interest
final.acs.var.list <- census.1yr.vars %>% filter(name %in% c(
"B01002_001" # median age
,"B03002_003","B03002_001" # Hon-Hisp-White-alone // Denom
,"B02009_001","B02001_001" # Black race, alone or in combination // Denom
,"B06008_003","B06008_001" # Now married // Denom
,"B23007_002","B23007_001" # Children under 18yrs in household // Denom
,"B15003_022","B15003_023","B15003_024","B15003_025","B15003_001" # Bachelors // Masters // Professional // Doctorate // Denom
,"B23022_027","B23022_026","B23022_003","B23022_002" # Worked in past 12 months-Female // Denom-Female // p12-Male // Denom-Male
,"B21001_002","B21001_001" # Veteran // Denom
,"B17001_002","B17001_001" # Past 12mo income at or below poverty level, Denom
,"B22010_002","B22010_001" # Received SNAP in past 12mo, Denom
,"B22008_001" # Median household income, past 12mo
,"B25008_003","B25008_001" # Renter occupied, Denom
,"B25024_002","B25024_003","B25024_001" # Single-fam, detached // single-fam, attached // Denom
,"B25064_001" # Median gross rent
,"B25088_002" # Median monthly owner costs for households with a mortgage
,"B08126_004","B08126_002","B08126_001" # INDUSTRY: Manufacturing // Agriculture, fishing, mining // Denom
,"B08006_003","B08006_001" # Drove to work alone // Denom
,"B08013_001" # Aggregate travel time to work in minutes
)) %>% arrange(name)
final.acs.var.listOnce more I loop through the state FIPS codes, retrieving the ACS variables of interest by tract as a flat file using tidycensus::get_acs and preserving the data only for the census tracts that contain AQS testing sites. As this is another slow step, I save the resulting data frame.
head(testing.sites.tracts)
# Keep only what is needed to make the ACS calls in a small data frame
tst.minimal <- testing.sites.tracts %>%
mutate(site_id = paste0(state_code,county_code,site_number)) %>%
dplyr::select(site_id, GEOID, STATEFP, COUNTYFP)
st_geometry(tst.minimal) <- NULL
# Loop through existing list of state codes
lower48.codes
# Make ACS calls one state at a time, only for counties with AQS testing sites
# Preserve estimate columns, ditch MOE
# Join to testing site list
# Stack ACS data together in a single data frame
acs.loop <- function(){
for (st in seq_along(lower48.codes)) {
newstate <- lower48.codes[st]
county.list <- tst.minimal %>%
filter(STATEFP==newstate) %>%
dplyr::select(COUNTYFP) %>%
arrange(COUNTYFP) %>%
as.matrix() %>% as.vector() %>% as.list()
results.acs <- get_acs( geography = "tract"
,variables=as.list(final.acs.var.list$name)
,state = newstate
,county = county.list
,output = "wide")
smaller.acs <- results.acs %>% dplyr::select(GEOID,matches(".*_.*E$"))
linked.acs <- tst.minimal %>%
filter(STATEFP==newstate) %>%
left_join(.,smaller.acs,by="GEOID")
if (st==1){
collect <- linked.acs
} else {
collect <- bind_rows(collect,linked.acs)
}
}
return(collect)
}
testing.sites.acs <- acs.loop()
testing.sites.acs
saveRDS(testing.sites.acs, file = "testing_sites_acs.RDS")With all the ACS data I need, I can create clean versions of the variables to be included in the analysis. I exclude tracts that do not have complete data for all 19 ACS-based variables or with fewer than 500 respondents in the sample (using the denominator from the set of race variables).
testing.sites.acs <- readRDS("testing_sites_acs.RDS")
head(testing.sites.acs)
# Create percentage variables and scaled medians/totals to use in modeling
testing.sites.acs.ready <- testing.sites.acs %>%
dplyr::rename( age.median = B01002_001E
,income.hh.median = B22008_001E
,rent.median = B25064_001E
,home.pmt.median = B25088_002E
,commute.time.agg = B08013_001E) %>%
mutate( race.black.any = 100*B02009_001E/B02001_001E
,ethn.nhisp.white.alone = 100*B03002_003E/B03002_001E
,married = 100*B06008_003E/B06008_001E
,kids = 100*B23007_002E/B23007_001E
,bachelor.plus = 100*(B15003_022E + B15003_023E + B15003_024E + B15003_025E)/B15003_001E
,veteran = 100*B21001_002E/B21001_001E
,manufacturing = 100*B08126_004E/B08126_001E
,farm.fish.mining = 100*B08126_002E/B08126_001E
,commutes.car.alone = 100*B08006_003E/B08006_001E
,renter = 100*B25008_003E/B25008_001E
,single.fam.home = 100*(B25024_002E + B25024_003E)/B25024_001E
,worked.12mo = 100*(B23022_027E + B23022_003E)/(B23022_026E + B23022_002E)
,poverty.12mo = 100*B17001_002E/B17001_001E
,snap.12mo = 100*B22010_002E/B22010_001E
,denominator = B02001_001E) %>%
dplyr::select(-matches("^B[012].*"))
head(testing.sites.acs.ready)
summary(testing.sites.acs.ready)
# testing.sites.acs.ready %>% dplyr::filter(denominator < 1000) %>% View()
# testing.sites.acs.ready %>% dplyr::filter(denominator < 500 | !complete.cases(.)) %>% View()
length(unique(testing.sites.acs.ready$GEOID))
# 953 - not unique by tract
# 34 tracts to exclude based on incomplete data and total denominator
# Bring the aggregates and medians to a comparable scale
testing.sites.acs.complete <-
testing.sites.acs.ready %>%
mutate(commute.time.agg = commute.time.agg/10000
,income.hh.median = income.hh.median/10000
,rent.median = rent.median/100
,home.pmt.median = home.pmt.median/100
) %>%
dplyr::rename(commute.time.agg.10k = commute.time.agg
,income.hh.median.10k = income.hh.median
,rent.median.100 = rent.median
,home.pmt.median.100 = home.pmt.median) %>%
dplyr::filter(denominator >= 500 & complete.cases(.))
summary(testing.sites.acs.complete)
length(unique(testing.sites.acs.complete$GEOID))I join the census tract GEOIDs to the daily PM2.5 data, and clean and summarize the PM2.5 data by tract. Tracts with fewer than 50 days of valid data are excluded.
summary(clean.2019.dailyPM2.5)
# Create a file of site IDs with GEOIDs that have complete ACS data.
site_id.GEOID.xwalk <- testing.sites.acs.complete %>%
dplyr::select(site_id,GEOID) %>%
unique()
GEOID.dailyPM2.5 <- inner_join( site_id.GEOID.xwalk
,clean.2019.dailyPM2.5
,by="site_id")
dim(GEOID.dailyPM2.5)
# 1,396,734 rows
dim(unique(GEOID.dailyPM2.5[,c("GEOID","date_local")]))
# 243,809 tract-days
table(GEOID.dailyPM2.5$event_type, useNA="always")
# Included, Excluded, None
table(GEOID.dailyPM2.5$validity_indicator, useNA="always")
table(GEOID.dailyPM2.5$validity_indicator, GEOID.dailyPM2.5$event_type, useNA="always")
# Y, N. 4282 x N
# Excluding validity_indicator == "N" does not get rid of a meaningful % of events
# GEOID.dailyPM2.5 %>% dplyr::filter(validity_indicator == "N") %>% View()
summary(GEOID.dailyPM2.5$arithmetic_mean)
# Create preliminary outcome variables
GEOID.daily.summary <-
GEOID.dailyPM2.5 %>%
dplyr::filter(validity_indicator=="Y") %>%
mutate(arithmetic_mean = case_when(arithmetic_mean < 0 ~ 0, TRUE ~ arithmetic_mean)) %>%
group_by(GEOID,date_local) %>%
summarise(mean = mean(arithmetic_mean),.groups = "keep")
# table(GEOID.daily.summary$any_event)
summary(GEOID.daily.summary$mean)
GEOID.AQI.outcomes <-
GEOID.daily.summary %>%
mutate(ungreen = case_when(mean > 12.0 ~ 1L, TRUE ~ 0L)) %>%
group_by(GEOID) %>%
summarise( pm2.5_days = n()
,pm2.5_p50 = quantile(mean, 0.50)
,pm2.5_ungreen_days = sum(ungreen)
,.groups="keep")
head(GEOID.AQI.outcomes)
summary(GEOID.AQI.outcomes)
# How many days of valid data do we have per tract?
table(GEOID.AQI.outcomes$pm2.5_days)
# Create final PM2.5 outcome variables
AQI.outcomes.final <-
GEOID.AQI.outcomes %>%
dplyr::filter(pm2.5_days >= 50) %>%
mutate( pm2.5_pct_ungreen = 100 * pm2.5_ungreen_days / pm2.5_days
,pm2.5_flag10_ungreen = case_when(pm2.5_ungreen_days * 10 > pm2.5_days ~ 1L, TRUE ~ 0L)
,pm2.5_flag20_ungreen = case_when(pm2.5_ungreen_days * 5 > pm2.5_days ~ 1L, TRUE ~ 0L)
) %>%
mutate( pm2.5_flag10_ungreen = factor(pm2.5_flag10_ungreen
,levels = c(0,1)
,labels = c("0-10%",">10%"))
,pm2.5_flag20_ungreen = factor(pm2.5_flag20_ungreen
,levels = c(0,1)
,labels = c("0-20%",">20%"))) %>%
dplyr::select(GEOID, pm2.5_p50, pm2.5_pct_ungreen, pm2.5_flag10_ungreen, pm2.5_flag20_ungreen)
# Review created outcome variables
head(AQI.outcomes.final)
summary(AQI.outcomes.final$pm2.5_p50)
summary(AQI.outcomes.final$pm2.5_pct_ungreen)
table(AQI.outcomes.final$pm2.5_flag10_ungreen)
table(AQI.outcomes.final$pm2.5_flag20_ungreen)
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag10_ungreen==">10%") %>% summary()
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag20_ungreen==">20%") %>% summary()With both predictors and outcomes summarized by census tract, the ACS and AQS files can now be joined to create the final data set.
str(AQI.outcomes.final)
str(testing.sites.acs.complete)
# Drop unnecessary geographic variables from the ACS file
acs.joinready <-
testing.sites.acs.complete %>%
dplyr::select(-site_id, -STATEFP, -COUNTYFP) %>%
unique()
dim(acs.joinready)
length(unique(acs.joinready$GEOID))
# 920 rows, unique on GEOID
# Join PM2.5 and ACS data by tract on GEOID
AQI.ACS <- inner_join(AQI.outcomes.final,acs.joinready,by="GEOID")
head(AQI.ACS)
# Prepare the geography/geometry file for linkage
str(testing.sites.tracts)
geom.joinready <-
testing.sites.tracts %>%
dplyr::select(GEOID,NAMELSAD,STATEFP,state,COUNTYFP,county,city,cbsa,cbsa_code,geometry) %>%
unique()
dim(geom.joinready)
length(unique(geom.joinready$GEOID))
geom.joinready %>% group_by(GEOID) %>% summarise(obs=n(),.groups="keep") %>% arrange(-obs) %>% head()
geom.joinready %>% dplyr::filter(GEOID=="06079012304")
# Hard-code one value to be able to keep city attribute unique by GEOID
geom.joinready2 <-
geom.joinready %>%
mutate(city = case_when(GEOID=="06079012304" ~ "Arroyo Grande", TRUE ~ city)) %>%
unique()
dim(geom.joinready2)
length(unique(geom.joinready2$GEOID))
head(geom.joinready2)
# Join geography/geometry file with AQS outcomes and ACS variables
AQI.ACS.geom <- inner_join(geom.joinready2,AQI.ACS,by="GEOID") %>% group_by()
head(AQI.ACS.geom)
class(AQI.ACS.geom)Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.
XXXXXXXXXXXXXXXXXPerform bivariate tests of community characteristics vs air quality (continous measures and 1/0 for poor air quality above threshold). Create multivariate models to predict air quality from community characteristics. Construct a classifier using community characteristics and evaluate its performance against the real data. XXXXXXXXXXXXXXXXX
Continuous PM2.5 outcomes by tract, mapped:
AQI.ACS.geom.WGS84 <- st_transform(AQI.ACS.geom, 4326) # FROM 4269 TO 4326 to cooperate with leaflet
palette <- colorNumeric("viridis", NULL, reverse=TRUE)
# Median PM2.5 by tract - interactive
popup_msg <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
,"<br>Median PM2.5: ",round(AQI.ACS.geom.WGS84$pm2.5_p50,digits=2))
leaflet(AQI.ACS.geom.WGS84) %>%
addPolygons(fillColor = ~palette(pm2.5_p50)
,fillOpacity = 0.9
,weight = 1.5
,color = "black"
,popup = popup_msg) %>%
addTiles() %>%
addLegend("bottomright",
pal=palette,
values=~pm2.5_p50,
title = 'PM2.5',
opacity = 1) %>%
addScaleBar()